Debug Node Registration Issues

This guide helps you debug issues that occur when registering nodes (devices) with a private ESP RainMaker deployment. Node registration is typically performed using the ESP RainMaker Admin CLI, which generates device certificates and bulk-registers nodes via an AWS Batch job.

Before you start, have these ready:

The request_id returned after running certs devicecert register
The node_id(s) of the affected nodes
The admin user ID (email) used to run the registration
Approximate time when the registration was triggered

Step 1: Identify Your Symptom

Symptom	Go to
Admin CLI `generate` command fails with an error	Admin CLI — Certificate Generation Errors
Admin CLI `register` command fails before submitting the job	Admin CLI — Registration Submission Errors
Registration job submitted but no confirmation email received	Registration Job Submitted — No Email or Status Unknown
`getcertstatus` shows `FAILURE` or some nodes failed	Registration Job Failed or Partial Failures
Registration job is stuck in `REQUESTED` or `INPROGRESS` for too long	Registration Job Stuck or Timed Out
Nodes registered but not visible on the RainMaker Dashboard	Nodes Not Visible on the RainMaker Dashboard
Node is registered but the device cannot connect to the cloud	Node Registered but Device Cannot Connect
Getting a specific error code (106xxx / 200xxx)	Error Code Reference

Node Registration Overview

Understanding the flow helps you identify at which stage a failure occurred.

Step 1  Admin CLI generates certificates locally → node_certs.csv
Step 2  CLI calls GET /admin/node_certificates/register → gets S3 pre-signed URL + request_id
Step 3  CLI uploads node_certs.csv to S3
Step 4  CLI calls POST /admin/node_certificates/register → triggers AWS Batch job
Step 5  AWS Batch processes each node: creates IoT Thing, registers certificate,
        attaches policy, writes to DynamoDB nodes_v3 table
Step 6  Admin receives email with job summary
Step 7  Device boots, connects to MQTT, publishes config → node visible on dashboard

Admin CLI — Certificate Generation Errors

These errors occur when running python rainmaker_admin_cli.py certs devicecert generate.

Check 1: Verify the command arguments

Error Message	Cause	Fix
`"Maximum of 50,000 nodes generation supported in a single request."`	`--count` exceeds 50,000	Split into multiple batches with `--count ≤ 50000`
`"<count> must be > 0"`	Count is zero or negative	Provide a valid `--count` value
`"'node_id' column not found in file"`	`--inputfile` CSV is missing the `node_id` column	Ensure the input CSV has a header row with `node_id` as the column name
`"CA key file is not provided"` / `"CA cert file is not provided"`	Only one of `--cacertfile` / `--cakeyfile` was given	Provide both `--cacertfile` and `--cakeyfile` together
`"At least one of the following must be provided: --count, ADDITIONAL_VALUES, --inputfile"`	No node count source specified	Provide `--count`, `--inputfile`, or configure `ADDITIONAL_VALUES` in `config/binary_config.ini`

Check 2: Verify the output directory

After a successful generate, confirm these files exist in the output directory:

<outdir>/<date>/Mfg-<N>/
  common/
    node_certs.csv    ← required for the next `register` step
    ca.crt            ← CA certificate
    node_ids.csv      ← list of generated node IDs
    endpoint.txt      ← MQTT broker hostname
  node_details/
    node-<idx>-<node_id>/
      node.crt        ← device certificate
      node.key        ← device private key

If any of these files are missing, re-run generate. If the output directory is missing entirely, the tool failed before writing any files — check for Python exceptions in the terminal output.

备注

The node_certs.csv in common/ is the input file for the register command. Use the full path when calling register --inputfile.

Admin CLI — Registration Submission Errors

These errors occur when running python rainmaker_admin_cli.py certs devicecert register.

Check 1: Validate the input CSV

Error Message	Cause	Fix
`"Input file is invalid. Please provide file containing the certificates"`	CSV has no `certs` column or all cert values are empty	Use the `node_certs.csv` generated by the `generate` step
`"Column count mismatch in row N"`	The CSV has inconsistent column counts	Open the CSV in a text editor and fix the row with index `N`
`"Certificate CN 'X' does not match node_id 'Y'"`	The certificate's Common Name does not match the `node_id` column	Regenerate the certificates — CN mismatch means the cert and node ID are from different batches
`"Invalid CSV file"` (error 106026)	CSV format is malformed	Validate the CSV with a CSV linter; check for unescaped quotes or missing commas

Check 2: Validate tags and policies

Error Message	Cause	Fix
`"Invalid tags specified by user. Check tags format."`	Tags are not in `key:value` format	Use `--tags key1:value1,key2:value2`
`"Invalid tags specified by user. Check whether the tags are referencing the proper column names."`	A tag references a CSV column that doesn't exist	Ensure the column name in `--tags key:@column_name` exactly matches a column in the CSV
`"--node_policies option cannot be used together with --update_nodes."`	Conflicting flags	Remove `--node_policies` when using `--update_nodes`
`"Invalid value for --node_policies"`	Unknown policy name	Valid values are `mqtt` and `videostream`

Check 3: Verify connectivity and authentication

Error Message	Cause	Fix
`"Could not connect. Please check your Internet connection."`	Admin CLI cannot reach the RainMaker backend	Check your internet connection; verify the server endpoint is correct: `account serverconfig`
`"HTTP Request timed out."`	Request took longer than 30 seconds	Retry. If this persists, check if the backend is reachable
`"Failed to upload Device Certificates"`	S3 pre-signed URL upload failed	The pre-signed URL may have expired (1-hour validity). Re-run `register` to get a fresh URL
`"Request to register device certificate failed"`	The `POST /admin/node_certificates/register` API call failed	Check the exact HTTP error code in the output. Run with verbose logging if available
`"Unable to verify SSL certificate."`	TLS verification failed	Verify that `rmaker_admin_lib/server_cert/server_cert.pem` is the correct certificate for your deployment

提示

When you successfully submit the registration job, the CLI prints a request_id. Save this value — you need it to check the job status later using getcertstatus --requestid <request_id>.

Registration Job Submitted — No Email or Status Unknown

If the registration job was submitted but you haven't received a confirmation email, or the status is unclear, follow these steps.

Step 1: Check the job status using the CLI

Run:

python rainmaker_admin_cli.py certs devicecert getcertstatus --requestid <request_id>

success → Job completed. All nodes registered. If nodes are still not visible, see Nodes Not Visible on the RainMaker Dashboard.
in_progress → Job is still running. Wait and check again. Large batches can take up to 10 hours.
failure → Job failed. See Registration Job Failed or Partial Failures.
No output / error → The request_id may be invalid, or the entry expired in DynamoDB (entries are kept for a limited time). Verify the request_id and check DynamoDB directly (Step 2).

Step 2: Check the request record in DynamoDB

Go to AWS Console → DynamoDB → Tables → admin_node_registration_requests.

Query with:

Partition key (user_id): the admin user's Cognito user ID
Sort key (request_id): the request ID from the CLI

What to look for:

Field	What it tells you
`status`	Current job state: `REQUESTED`, `INPROGRESS`, `SUCCESS`, `FAILURE`
`total_count`	Total nodes in the uploaded CSV
`completed_count`	Nodes successfully registered so far
`failed_count`	Nodes that failed registration
`request_timestamp`	When the job was submitted

If no entry is found with that request_id, the job was never submitted to DynamoDB. The POST /admin/node_certificates/register API call likely failed silently. Re-run the register command.

Step 3: Check the Lambda log for submission errors

Go to CloudWatch → Logs Insights, select /aws/lambda/esp-CertificateRegister, and run:

fields @timestamp, @message
| sort @timestamp asc
| filter @message like "<request_id>"

Look for:

Successful job submission: message containing "Submitted batch job" or the job ID
Any error messages indicating why the submission failed

Step 4: Check if the confirmation email was blocked

The confirmation email is sent via AWS SES. If SES is not verified for your deployment, emails may be silently dropped. Run the pre-flight check:

# The CLI checks SES status during register — look for any SES warning in the output

Also check AWS Console → SES → Verified identities to confirm the sender email is verified. If it is not, verify it and re-run the registration job.

Registration Job Failed or Partial Failures

Step 1: Get the overall failure summary

Check the status via CLI or DynamoDB as described above. Note the failed_count and completed_count fields in the admin_node_registration_requests table.

Step 2: Find which specific nodes failed

Go to AWS Console → DynamoDB → Tables → node_manufacturing_errors.

Query with partition key (request_id): the request ID.

This table contains one entry per failed node, with fields:

node_id — which node failed
error — the error message from AWS IoT Core or the batch container
request_id — links back to the registration job

Step 3: Check the AWS Batch job logs

The bulk registration runs inside an AWS Batch container. The container logs are the most detailed source of per-node errors.

Go to AWS Console → Batch → Jobs.
Filter by Job queue: thing-certificate-registration.
Find your job by checking the submission time (matches request_timestamp in DynamoDB).
Click the job → click Log stream to open the CloudWatch log stream.

The log stream is under the log group /aws/batch/job. Each node's registration attempt is logged here with the outcome.

What to look for in Batch logs:

Log message	Meaning
`"Thing already exists"`	A node with this ID is already registered. Use `--force` flag to allow re-registration
`"Certificate is already Provisioned"`	The same certificate was registered before. Use `--force`
`"Error in registering certificate"`	The certificate PEM is malformed or invalid. Regenerate the certificate for this node
`"Invalid Certificate"`	Certificate format error. Check for truncated PEM data in the CSV
`"Error in creating thing"`	AWS IoT Core `CreateThing` failed. Check IAM role permissions for the Batch job
`"Node limit exceeded"`	Your deployment's licensed node count is exhausted. Contact Espressif to increase the limit

Step 4: Re-register failed nodes

After identifying and fixing the root cause:

Extract the failed node_id values from the node_manufacturing_errors table.
Create a new CSV containing only the failed nodes (with their certificates from node_details/).
Re-run register --inputfile <new_csv> --force to register them without failing on any already-registered nodes.

提示

The --force flag tells the server to skip duplicate node errors and continue registering remaining nodes. Use it when re-running a partially failed job.

Registration Job Stuck or Timed Out

The AWS Batch job has a maximum timeout of 10 hours (36000 seconds). For very large batches, the job can run close to this limit.

Step 1: Check the AWS Batch job status

Go to AWS Console → Batch → Jobs.
Filter by job queue thing-certificate-registration.
Find the job matching your request_id (visible in the job name or environment variables).

Job Status	Meaning
`SUBMITTED` / `PENDING`	Job is queued, waiting for a compute instance
`RUNNABLE`	Job is waiting for compute capacity in the environment
`STARTING` / `RUNNING`	Job is actively processing
`SUCCEEDED`	All nodes processed
`FAILED`	Container exited with a non-zero code or hit the 10-hour timeout

If the job is stuck in PENDING or RUNNABLE for more than 10–15 minutes, the compute environment may not have capacity. Check:

AWS Console → Batch → Compute environments → ThingCertificateRegister: verify the environment is ENABLED and VALID.
Check if the EC2 Service Limit for the instance type is reached in your region.

Step 2: Check for Batch job timeout

If the job status is FAILED and the batch ran for exactly 10 hours, it hit the timeout. This typically happens with very large batches (tens of thousands of nodes).

Fix:

Split the CSV into smaller batches and register each separately.
The recommended batch size is 10,000–20,000 nodes per job.

Step 3: Check CloudWatch for the Batch container logs

Go to CloudWatch → Log groups → /aws/batch/job and find the log stream for the failed job.

Look for:

The last completed_count logged before the job was killed — this tells you how many nodes were registered before the timeout.
Any specific error that caused the container to exit prematurely (e.g., DynamoDB throttling, IoT API rate limits).

Step 4: Check the DynamoDB request record

Check admin_node_registration_requests for the completed_count at the time of failure. Nodes with a lower index than completed_count are registered. Re-register only the remaining nodes using the --force flag.

Nodes Not Visible on the RainMaker Dashboard

Even after a successful bulk registration, nodes may not be visible on the dashboard until the device connects and sends its configuration. There are two distinct cases.

Case A: Node Registered but Never Appeared on Dashboard

Bulk registration creates the IoT Thing and certificate in AWS IoT Core and writes a record to DynamoDB nodes_v3. However, the full device configuration (name, type, firmware version, parameters) is only stored when the device itself publishes its config after first boot.

Step 1: Verify the node exists in DynamoDB

Go to AWS Console → DynamoDB → Tables → nodes_v3.

Query with partition key node_id.

Entry exists → Node is registered in the system. The dashboard should show it (possibly with limited info until the device publishes config). If it doesn't appear, check admin dashboard permissions.
No entry found → The bulk registration did not complete for this node. Check node_manufacturing_errors for this node ID and re-register it.

Step 2: Check if the node is in the pending registration table

Go to DynamoDB → Tables → admin_pending_registration_nodes.

Query with:

Partition key (user_id): the admin user ID
Sort key (node_id): the node ID

If the entry exists here but not in the admin dashboard view, the dashboard may need a refresh, or the node is awaiting the device to send its first config.

Case B: Device Booted but Node Config Not Updating

After the device boots and connects to MQTT, it should publish its configuration to the topic node/<node_id>/config. This triggers the esp-RegisterDevice Lambda, which stores the device config in DynamoDB.

Step 1: Verify the device published its config

Go to CloudWatch → Logs Insights, select /aws/lambda/esp-RegisterDevice, and run:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like "<node_id>"

Entries found with no errors → Config was received and stored. Refresh the dashboard.
Entries found with errors → Note the error and check the device's config payload format.
No entries found → The device did not publish its config, or the MQTT rule esp_node_config is not routing messages to the SQS queue. See Node Registered but Device Cannot Connect.

Step 2: Check the SQS queue for stuck messages

If the device is publishing but the Lambda is not processing:

Go to AWS Console → SQS → esp-deviceRegisterSQS.
Check Messages available and Messages in flight.
If there are messages in the Dead Letter Queue (esp-FailedMessageDLQ), click Send and receive messages → Poll for messages to inspect them.

Failed messages in the DLQ indicate the esp-RegisterDevice Lambda is failing to process them. Check the Lambda logs for errors.

Step 3: Check the Lambda log for config processing errors

Go to CloudWatch → Logs Insights, select /aws/lambda/esp-RegisterDevice, and run:

fields @timestamp, @message
| filter @message like /error/i or @message like /failed/i
| sort @timestamp desc
| limit 50

Look for JSON parse errors or DynamoDB write failures that could cause the node config to not be stored.

Node Registered but Device Cannot Connect

If the node is registered in DynamoDB and AWS IoT Core, but the physical device cannot establish an MQTT connection:

Step 1: Verify the IoT Thing and certificate exist in AWS IoT Core

Go to AWS Console → IoT Core → Manage → All devices → Things.
Search for the node ID.
Click the thing → go to Certificates tab.
Confirm a certificate is attached and its status is Active.

If the certificate status is Inactive or Revoked, the device cannot connect.

Fix: Activate the certificate:

Click the certificate → Actions → Activate.

If no certificate is attached, the bulk registration may have created the Thing but failed to attach the certificate. Check node_manufacturing_errors for this node.

Step 2: Verify the IoT policy is attached

On the same certificate page, go to the Policies tab. Confirm the esp-rainmaker-iot-policy (or equivalent policy for your deployment) is attached.

If no policy is attached, the device will connect to MQTT but all publishes and subscribes will be denied with an AUTH_ERROR.

Fix: Attach the policy:

Click Actions → Attach policy → select esp-rainmaker-iot-policy.

Step 3: Verify the device is using the correct certificate and key

The certificate (node.crt) and private key (node.key) must be flashed to the device from the same batch as the one registered with the cloud. If the device firmware uses different certificate files, it will not be able to authenticate.

Check that the NVS binary (bin/node-<idx>-<node_id>.bin) was flashed to the correct device.

Step 4: Check node connection logs

Go to CloudWatch → Logs Insights, select /aws/lambda/esp-ConnectionNode, and filter by <node_id>:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like "<node_id>"

Look for AUTH_ERROR or FORBIDDEN_ACCESS disconnect reasons, which indicate a certificate or policy issue.

See Debugging Node Connection Issues for a full guide on MQTT connection problems.

Step 5: Verify the device is connecting to the correct MQTT endpoint

The device must connect to the MQTT endpoint of your private RainMaker deployment, not the default Espressif endpoint. Confirm the endpoint.txt file generated during the generate step was used when building the firmware's NVS partition.

Run:

cat <outdir>/<date>/Mfg-<N>/common/endpoint.txt

Compare this with the MQTT host your device is configured to use.

Error Code Reference

Bulk Node Creation Errors (106xxx)

Error Code	Message	Likely Cause and Fix
106001	Node count should be > 0 and ≤ 10000	Use `--count` between 1 and 10,000 per request
106004	Request ID is not valid	The `request_id` passed to `getcertstatus` is wrong or expired
106007	URL requested is expired	The S3 pre-signed URL timed out (1-hour validity). Re-run `register`
106008	Error fetching pre-signed URL	Backend error. Retry the registration command
106009	File name is missing	Provide `--inputfile` with a valid CSV path
106010	Error submitting thing registration job	AWS Batch job submission failed. Check if the Batch compute environment is healthy
106011	File md5 is missing	The CLI could not compute the MD5 of the CSV. Verify the file is readable
106016	No registration request in progress	No active job for this `request_id`. The job may have already completed or the ID is wrong
106020	Total registered nodes exceeds limit	Deployment's licensed node limit reached. Contact Espressif to increase the quota
106026	Invalid CSV file	The uploaded CSV is malformed. Validate the file format
106031	CSV must have columns: certs, node_id or CN	Ensure the CSV has `node_id` and `certs` columns
106033	Node ID does not match certificate CN	Certificate was generated for a different node ID. Regenerate certificates
106036	Invalid node policy	Valid values: `mqtt`, `videostream`
106037	node_policies cannot be used with update_nodes	Remove `--node_policies` when using `--update_nodes`

Self-Claim / Device Registration Errors (200xxx)

Error Code	Message	Likely Cause and Fix
200001	MAC Address is missing	`mac_addr` not provided to `/claim/node`
200009	Claim does not exist	Node was not pre-claimed or the MAC address lookup failed
200019	Error in creating thing	AWS IoT Core `CreateThing` call failed. Check IAM permissions for the claim Lambda
200020	Certificate is already Provisioned	This certificate is already registered. Use `--force` to re-register
200021	Error in registering certificate	Certificate PEM is invalid or the IoT API returned an error
200022	Invalid Certificate	The certificate data is malformed or expired
200036	Invalid node policy	Valid policies: `mqtt`, `videostream`

CloudWatch Log Groups Reference

Log Group	When to Use
`/aws/lambda/esp-CertificateRegister`	Check registration job submission, pre-signed URL generation, job trigger errors
`/aws/lambda/esp-NodeIdGeneration`	Check node ID generation status when using cloud-based ID generation
`/aws/lambda/esp-RegisterDevice`	Check if node config MQTT message was received and stored
`/aws/lambda/esp-RegisterNode`	Check HTTPS-based node config registration
`/aws/lambda/esp-createAndRegisterThing`	Check self-claim device registration errors
`/aws/lambda/esp-ConnectionNode`	Check device MQTT connect/disconnect events
`/aws/batch/job`	Check detailed per-node logs from the bulk registration Batch container

DynamoDB Tables Reference

Table	When to Check	Key to Query
`admin_node_registration_requests`	Check bulk job status, progress counts	`user_id` (partition), `request_id` (sort)
`node_manufacturing_errors`	Find which specific nodes failed in a batch job	`request_id` (partition), `node_id` (sort)
`nodes_v3`	Verify a node is registered in the system	`node_id` (partition)
`admin_pending_registration_nodes`	Check nodes registered by admin but not yet claimed by a user	`user_id` (partition), `node_id` (sort)

Step 1: Identify Your Symptom​

Node Registration Overview​

Admin CLI — Certificate Generation Errors​

Check 1: Verify the command arguments​

Check 2: Verify the output directory​

Admin CLI — Registration Submission Errors​

Check 1: Validate the input CSV​

Check 2: Validate tags and policies​

Check 3: Verify connectivity and authentication​

Registration Job Submitted — No Email or Status Unknown​

Step 1: Check the job status using the CLI​

Step 2: Check the request record in DynamoDB​

Step 3: Check the Lambda log for submission errors​

Step 4: Check if the confirmation email was blocked​

Registration Job Failed or Partial Failures​

Step 1: Get the overall failure summary​

Step 2: Find which specific nodes failed​

Step 3: Check the AWS Batch job logs​

Step 4: Re-register failed nodes​

Registration Job Stuck or Timed Out​

Step 1: Check the AWS Batch job status​

Step 2: Check for Batch job timeout​

Step 3: Check CloudWatch for the Batch container logs​

Step 4: Check the DynamoDB request record​

Nodes Not Visible on the RainMaker Dashboard​

Case A: Node Registered but Never Appeared on Dashboard​

Step 1: Verify the node exists in DynamoDB​

Step 2: Check if the node is in the pending registration table​

Case B: Device Booted but Node Config Not Updating​

Step 1: Verify the device published its config​

Step 2: Check the SQS queue for stuck messages​

Step 3: Check the Lambda log for config processing errors​

Node Registered but Device Cannot Connect​

Step 1: Verify the IoT Thing and certificate exist in AWS IoT Core​

Step 2: Verify the IoT policy is attached​

Step 3: Verify the device is using the correct certificate and key​

Step 4: Check node connection logs​

Step 5: Verify the device is connecting to the correct MQTT endpoint​

Error Code Reference​

Bulk Node Creation Errors (106xxx)​

Self-Claim / Device Registration Errors (200xxx)​

CloudWatch Log Groups Reference​

DynamoDB Tables Reference​